An Intelligent System to Sense Textual Cues for Location Assistance in Autonomous Vehicles

The current technological world is growing rapidly and each aspect of life is being transformed toward automation for human comfort and reliability. With autonomous vehicle technology, the communication gap between the driver and the traditional vehicle is being reduced through multiple technologies and methods. In this regard, state-of-the-art methods have proposed several approaches for advanced driver assistance systems (ADAS) to meet the requirement of a level-5 autonomous vehicle. Consequently, this work explores the role of textual cues present in the outer environment for finding the desired locations and assisting the driver where to stop. Firstly, the driver inputs the keywords of the desired location to assist the proposed system. Secondly, the system will start sensing the textual cues present in the outer environment through natural language processing techniques. Thirdly, the system keeps matching the similar keywords input by the driver and the outer environment using similarity learning. Whenever the system finds a location having any similar keyword in the outer environment, the system informs the driver, slows down, and applies the brake to stop. The experimental results on four benchmark datasets show the efficiency and accuracy of the proposed system for finding the desired locations by sensing textual cues in autonomous vehicles.


Introduction
Autonomous vehicles (AV) have gained significant popularity in recent years due to the vast revolution in modern transportation systems. An autonomous vehicle is a self-driving vehicle that is efficient at perceiving its outer environment and moving without or with very limited human involvement. The various renowned reports and surveys predict that by 2030, autonomous vehicles will be capable and reliable enough to replace maximum human driving [1,2]. In this scenario, many new methods are being proposed to facilitate autonomous vehicles' vision perception, sensing the outer environment, safety aspects, traffic laws and regulations, accident liability, and maintaining the surrounding map [3][4][5].
An autonomous vehicle can rely on multiple sensors, complex algorithms, actuators, machine learning tools, computer vision techniques, and reliable processors to take effect [6,7]. The autonomous vehicle perceives the outer environment with the help of numerous sensors and makes the decision by perceiving with the assistance of computer vision [8,9]. Each sensor's configuration and mechanism varies, as an example in [10], the sideslip angle estimation algorithm for autonomous vehicles is proposed. The algorithm is based on a consensus Kalman filter that fuses measurements from a reduced inertial navigation system (R-INS), a global navigation satellite system (GNSS), and a linear vehicle-dynamic-based sideslip estimator.
Since the last few decades, advanced driver assistance systems (ADAS) are equally appreciated to avoid traffic accidents and to improve driving comfort in autonomous vehicles [11]. The ADAS systems are safe and secure systems designed to decrease the model-based vehicle slip angle (VSA) estimation method that fuses information from a GNSS and an IMU. The method is designed to be robust against the effects of vehicle roll and pitch, a low sampling rate of GNSS, and GNSS signal delay.
Since the early days of mechanical vehicles, safety has been one of the key concerns in automotive systems. Several attempts have been made to address safety concerns by developing safe and secure systems to protect the driver as well as prevent injuring pedestrians [29,30]. It is one of the safety aspects of an autonomous vehicle when the driver is preoccupied with searching for the desired location to stop. With our proposed system, the safety of AV can be increased drastically since the AV will automatically realize the desired locations in its surrounding. Rather than continually searching for the desired locations, our proposed system will automatically realize the textual cues present in the outer environment and suggest the driver to stop.
While driving on the road, AV performs multiple operations such as lane change, lane keeping, overtaking, and following the traffic rules. Several studies proposed and developed numerous methods for ADAS systems [31,32]. It is equally important for an autonomous vehicle to be aware of the textual cues appearing in its outer environment to take some decision or assist the driver, either to stop or drive. Thus, the key concern of this paper is to reduce human intervention in an autonomous vehicles.
In this paper, we propose a novel intelligent system based on the driver's instruction for finding the desired locations using textual cues present in the outer environment for advanced driver assistance systems. For this, we combine computer vision and natural language processing (NLP) techniques to perceive textual cues. Computer vision methods train the system to interpret and perceive the visual world around the autonomous vehicle and NLP techniques emphasize the system with the ability to read, recognize, and derive the meaning from textual cues appearing in front of an autonomous vehicle. The key contributions of this paper are as follows: • A novel intelligent system is proposed for AVs to find unsupervised locations. • The proposed system is capable of sensing the textual cues that appear in the outer environment for determining desired locations. • The proposed system is a novel development in the list of ADAS features of an autonomous vehicle. • With the proposed system, the driver's efforts for finding the desired locations will drastically be decreased.
The remainder of the paper is organized as follows. In Section 2, we describe the proposed system for finding the desired locations with textual cues and their formation as keywords. In Section 3, the experimental results are defined to show the efficiency and accuracy of the proposed system. Section 4 concludes the proposed work and presents the future directions.

Proposed System
In this section, we propose a novel intelligent system to find the desired locations using textual cues for an autonomous vehicle. Firstly, the driver inputs one or more keywords to the proposed system to find the desired locations. Secondly, the proposed system detects and localizes the textual cues appearing in the outer environment. The system will generate the keywords localized from the outer environment with detection and recognition methods. Finally, the system will execute similarity learning to find the similarity between the input keywords and the localized keywords from outer environment images. The schematic diagram of the proposed intelligent system is shown in Figure 1.

Textual Cues Detection
In order to detect, localize, and form the keywords from the outer environment, we employ text detection and localization technique. Firstly, we use affine transformation to deal with global distortion appearing within an input image and to improve the accuracy of the text to a more horizontal text. It takes an input image with channel i C , height i H , and width i W to produce an output image g I . The affine transformation based on the arguments between the input image i I and output image g I is given as: where ( , ) i i x y are the source coordinates of the input image and ( , ) g g x y are the required coordinates for the output image g I . The output image g I is further rectified from the input image i I using bilinear interpolation, given as: is the pixel value of the rectified image g I at the location ( , ) is the pixel value of the input image i I at the location ( , ) n m .

Textual Candidates Detection
The textual candidates detection aims to extract the position of textual regions in the outer environment. Since the text appearing in the outer environment generally has diverse contrast to its relative background and uniform color intensity, the maximally stable extremal region (MSER) technique is the best approach as it is widely used and considered the best region detector [33]. In order to detect the textual candidates appearing in the outer environment, we adopt the MSER approach for finding the corresponding

Textual Cues Detection
In order to detect, localize, and form the keywords from the outer environment, we employ text detection and localization technique. Firstly, we use affine transformation to deal with global distortion appearing within an input image and to improve the accuracy of the text to a more horizontal text. It takes an input image I i ∈ R C i ×H i ×W i with channel C i , height H i , and width W i to produce an output image I g . The affine transformation based on the arguments between the input image I i and output image I g is given as: where (x i , y i ) are the source coordinates of the input image and (x g , y g ) are the required coordinates for the output image I g . The output image I g is further rectified from the input image I i using bilinear interpolation, given as: where I g (x g , y g ) is the pixel value of the rectified image I g at the location (x g , y g ) and I i(n,m) is the pixel value of the input image I i at the location (n, m).

Textual Candidates Detection
The textual candidates detection aims to extract the position of textual regions in the outer environment. Since the text appearing in the outer environment generally has diverse contrast to its relative background and uniform color intensity, the maximally stable extremal region (MSER) technique is the best approach as it is widely used and considered the best region detector [33]. In order to detect the textual candidates appearing in the outer environment, we adopt the MSER approach for finding the corresponding candidates within the input image I g (x g , y g ). For finding the extremal regions in the input image, the intensity difference is given as: where |R| represents the extracted extremal regions area, R(+∆) represents the extremal regions, +∆ specifies the increment of each extremal region R, and |R(+∆) − R| shows the area difference between the two regions' area. After applying the region detector, the obtained extremal regions are shown in Figure 2. candidates within the input image ( , ) g g g I x y . For finding the extremal regions in the input image, the intensity difference is given as: where R represents the extracted extremal regions area, ( ) R +Δ represents the extrema regions, +Δ specifies the increment of each extremal region R , and ( ) R R +Δ − shows the area difference between the two regions' area. After applying the region detector, the obtained extremal regions are shown in Figure 2.

Textual Candidates Filtering
The textual regions detected in the previous step using the MSER technique are fur ther refined and rectified. First, we validate the size and the aspect ratio using geometric properties for textual candidates filtering, which is given as: where h and w are the height and width of the aligned bounding box of segmented axes respectively, and min h , max h , min r and max r are components to finetune.

Textual Candidates Filtering
The textual regions detected in the previous step using the MSER technique are further refined and rectified. First, we validate the size and the aspect ratio using geometric properties for textual candidates filtering, which is given as: where h and w are the height and width of the aligned bounding box of segmented axes, respectively, and h min , h max , r min and r max are components to finetune.
The input image (x d , y d ) ∈ d having the size W × H and the predicted categorized result α and α with uncertain probability sequence p and p is given as: where where D and K represent the character sequence length. The input vector V is combined using the following properties: The above four properties are probability characteristics in which mean represents the overall confidence score and min represents the least likely character. Furthermore, where θ is the constant parameter. The above two properties are used to normalize the number of characters between 0 and 1, and the following two properties are used for character width calculated as per geometric properties: where κ represents the character width. The localized regions which satisfy the above properties are then further processed and the remaining regions are discarded, as shown in Figure 3. The obtained localized regions consist of non-textual regions and may produce a false result for recognition. We further segment textual regions with the stroke responses of each image pixel. The corner points are used as the edges of two strokes. The corner points and stroke points establish the distortion of strokes. For this, we follow the corner detection approach [34], which applies the following selection criteria. Firstly, the matrix M for each pixel is calculated as follows: where w(x, y) represents the weight at position (x, y) for window center, I x and I y denotes the gradient value of pixel at position (x, y). The eigenvalues λ 1 and λ 2 of M matrix are calculated as: where ( , ) w x y represents the weight at position ( , ) x y for window center, x I and y I denotes the gradient value of pixel at position ( , ) x y . The eigenvalues 1 λ and 2 λ of M matrix are calculated as: To compute the turning point of outer stroke endpoints, we use the following equation [35]: where 1 X , 2 X , 1 Y , and 2 Y are the coordinates of the endpoints of the strokes; x * and y * are coordinates of the outermost points; and , x y denotes coordinates of every single point at the curve. The following equation is given to determine outer stroke points: where ( , ) P x y denotes a single point at the x-axis and y-axis in word image.
Given the corner point ( , ) c c P X Y along with its adjacent corner ( , ) n n P X Y , the height area h and width area w of a moving window is determined as: where α is a coefficient to normalize the area of the moving region among the corner points and is set between 0 and 1. Moreover, the moving area of outer strokes ( , ) for the side length area area s is given as: To compute the turning point of outer stroke endpoints, we use the following equation [35]: where X 1 , X 2 , Y 1 , and Y 2 are the coordinates of the endpoints of the strokes; x * and y * are coordinates of the outermost points; and x, y denotes coordinates of every single point at the curve. The following equation is given to determine outer stroke points: where P(x, y) denotes a single point at the x-axis and y-axis in word image. Given the corner point P(X c , Y c ) along with its adjacent corner P(X n , Y n ), the height h area and width w area of a moving window is determined as: where α is a coefficient to normalize the area of the moving region among the corner points and is set between 0 and 1. Moreover, the moving area of outer strokes P(X o , Y o ) for the side length area s area is given as: where β is a coefficient to normalize the moving regions among the outer strokes and is set between 0 and 1. The final filtered localized textual regions are shown in Figure 4.
ensors 2023, 23, 4537 8 of 19 where β is a coefficient to normalize the moving regions among the outer strokes and is set between 0 and 1. The final filtered localized textual regions are shown in Figure 4.

Keywords Grouping and Recognition
The localized textual regions in the previous steps consist of individual text characters. In order to recognize and understand the meaning of these textual regions, these individual characters must be combined into text lines. This way, the localized textual regions may represent more meaningful information about the outer environment as compared to the individual characters. For example, the localized textual region consists of the "SCHOOL" versus the individual character set {C,O,L,O,S,H} where its meaning is lost due to the unordered sequence of the word [36][37][38].
In order to form the ordered keywords, we employ the grouping approach [39]. The key idea is to apply a rectangle x y and orientation p θ . Each associated region is considered to be a keyword candidate. The initial candidate regions having 1 0.4 w = are refined to be the keyword with the following properties: (1) The two adjacent textual candidates are associated with a new i w value.
(2) The achieved keyword candidate which is the combination of two candidates is obtained with curvilinear.
If the centers of connected regions in τ are estimated normally with a kth order polynomial, then the candidate keyword ( ) τ ζ ⊂ is determined as curvilinear:

Keywords Grouping and Recognition
The localized textual regions in the previous steps consist of individual text characters. In order to recognize and understand the meaning of these textual regions, these individual characters must be combined into text lines. This way, the localized textual regions may represent more meaningful information about the outer environment as compared to the individual characters. For example, the localized textual region consists of the "SCHOOL" versus the individual character set {C,O,L,O,S,H} where its meaning is lost due to the unordered sequence of the word [36][37][38].
In order to form the ordered keywords, we employ the grouping approach [39]. The key idea is to apply a rectangle ωs p × hs p for each connected region having the center (x p , y p ) and orientation θ p . Each associated region is considered to be a keyword candidate. The initial candidate regions having w 1 = 0.4 are refined to be the keyword with the following properties: (1) The two adjacent textual candidates are associated with a new w i value.
(2) The achieved keyword candidate which is the combination of two candidates is obtained with curvilinear.
If the centers of connected regions in τ are estimated normally with a kth order polynomial, then the candidate keyword τ(⊂ ζ) is determined as curvilinear: where (x p , y p ) is the rotated point of (x p , y p ) and s = 1 τ ∑ s p is the average score. The bounding boxes are applied to the character set of textual regions, as shown in Figure 5. Given the metrics , , X Y Z and the confidence score for individual word supposition for the word confidence score is given as: Each individual hypothesis word w is optimized for breakpoints, and a word having an optimal score is recognized as: The unary fraction scores given in Equation (22) are determined with the following properties: the distance from outside the image boundaries, the distance from the estimated breakpoint location, the binary fraction score, the non-text class score, and the distance of the first and last breakpoints from the edge of the image. The pairwise score given in Equation (23) is determined with the following properties: non-text scores at character centers, character scores at midpoints amid breakpoints, eccentricity from the normalized character width, and active contributions of the left and right binary responses comparative to character scores.
The bounding boxes are applied to recognized words in order to match the evaluated breakpoints, and the recognized bounding boxes are added to the queue of recognized words. The recognized keywords are shown in Figure 6. The grouped keywords from localized textual regions are further processed for recognition purposes. The cropped word images I ∈ R W×H having width W and height H consisting of the textual cues are recognized individually. The inputs are the 2D maps resulting in a W × H map for the individual character supposition.
Given the metrics X, Y, Z and the confidence score for individual word supposition . . , b w L w +1 ) represent the breakpoints amid individual characters, where b w 1 initializes the first character and b w L w ends the last character. The breakpoint hypothesis (w, b w ) for the word confidence score is given as: Each individual hypothesis word w is optimized for breakpoints, and a word having an optimal score is recognized as: The unary fraction scores given in Equation (22) are determined with the following properties: the distance from outside the image boundaries, the distance from the estimated breakpoint location, the binary fraction score, the non-text class score, and the distance of the first and last breakpoints from the edge of the image. The pairwise score given in Equation (23) is determined with the following properties: non-text scores at character centers, character scores at midpoints amid breakpoints, eccentricity from the normalized character width, and active contributions of the left and right binary responses comparative to character scores.
The bounding boxes are applied to recognized words in order to match the evaluated breakpoints, and the recognized bounding boxes are added to the queue of recognized words. The recognized keywords are shown in Figure 6.

Textual Cues Keywords
The localized textual regions are optimized with the OCR and the formal words are recognized, thus providing a sensible meaning. In this step, we utilize the recognized formal words to establish a words model that will be responsible for sequences and boundaries. Since the recognized textual cues may still be missing some characters and may affect finding the desired locations, we employ an n-gram probabilistic language model that will provide evidence for the presence of the actual cues [40].
An n-gram model is generally used to predict the probability of a given n-gram in any contiguous sequence of words. A better n-gram model predicts the next word in a sentence. For example, given the word 'park', the first recognized trigram is 'par' and the second recognized trigram is 'ark', and then its overlapping characters 'ar' suggests that the correctly recognized word is likely to be 'park'.
Given the word w of length N as a sequence of characters i c C ∈ = denotes a character at i position in word w from 26 letters and 10 digits, each recognized word has a varying length N that can be determined at the run time. Therefore, the number of characters in a single word is fixed to 22 with a null character and a maximum length class, which is given as:

Textual Cues Keywords
The localized textual regions are optimized with the OCR and the formal words are recognized, thus providing a sensible meaning. In this step, we utilize the recognized formal words to establish a words model that will be responsible for sequences and boundaries. Since the recognized textual cues may still be missing some characters and may affect finding the desired locations, we employ an n-gram probabilistic language model that will provide evidence for the presence of the actual cues [40].
An n-gram model is generally used to predict the probability of a given n-gram in any contiguous sequence of words. A better n-gram model predicts the next word in a sentence. For example, given the word 'park', the first recognized trigram is 'par' and the second recognized trigram is 'ark', and then its overlapping characters 'ar' suggests that the correctly recognized word is likely to be 'park'.
Given the word w of length N as a sequence of characters w = (c 1 , c 2 , c 3 , . . . , c N ) where each c i ∈ C = {1, 2, 3, . . . , 36} denotes a character at i position in word w from 26 letters and 10 digits, each recognized word has a varying length N that can be determined at the run time. Therefore, the number of characters in a single word is fixed to 22 with a null character and a maximum length class, which is given as: For two strings and w ∈ C M , the s ⊂ w represents s as a substring of the word w. An N-gram of w is assumed as substring s ⊂ w having the length |w| = N. The dictionary of all grams of word w of length N is given as: As an example, the dictionary for the word 'cafe' is G 3 (cafe) = {c, a, f, e, ca, af, fe, caf, afe}. Given the recognized ith n-gram w n,i and its consistent confidence score c n,i , in order to determine the sequence of n-grams with the most confident prediction for the entire sequence of recognized words, the objective function can be given as: where Here, W n,i is used to achieve the optimal n-gram separation of the given word, and each n-gram word image is recursively recognized.

Similarity Learning
Similarity learning finds and matches similar images as the user-input keywords [41][42][43]. The proposed intelligent system matches the user-input keywords with the outer environment textual cues. For this, we create a feature vector of user input keywords and the recognized textual cues from the outer environment images.
Given the input keywords Q, the word q i is treated as a sequence of characters (y 1 , y 2 , y 3 , . . . , y |q i | ), where |q i | denotes the total number of characters in word q i , and y j is considered as the optimal representation of the j th character of word q i . Each sequence is interpolated and concatenated with a fixed-length featuref i ∈ R T×2C and all the features are signified as output featuresF ∈ R N×T×2C .
The recognized textual cue proposals E ∈ R K×T×C and the input keywords F ∈ R N×T×C are formed, and the similarity is computed as a similarity matrix S(Q, P) ∈ R N×K between the input keywords Q and recognized textual cues. The scorê S i,j (Q, P) between both the feature vectors F i and E j is given as: where V represents the operator that converts the 2D matrix into a 1D vector. The required similarity matrixŜ(Q, P) is maintained by the target similarity matrix S(Q, P). The target similarityŜ i,j (Q, P) is computed as the Levenshtein distance between corresponding textual pairs (q i , q j ) and is given as: Meanwhile, during implementation for the ranking, the similarity between the input keywords and recognized textual words equals to the maximum value ofŜ i,j (Q, P).

Experimental Results and Discussion
This section presents evaluation results to show the efficiency and accuracy of the proposed intelligent system. The evaluation protocols and benchmark datasets used for experimentation are given as follows.

Datasets
As the proposed intelligent system is capable to find and locate the desired locations using textual cues, we evaluated our system on four different benchmark datasets comprising the outer environment images. In these datasets, different textual cues can be found and located for the autonomous vehicle's driving assistance system.
Street View Text (SVT): The dataset [44] consists of outer environment images and certain textual cues appear on the different objects such as walls, shops, billboards, buildings, etc. This dataset contains 250 trained images and 100 test images, and each image has a varying dimension from 1024 × 768 to 1920 × 906.
ICDAR 2013: The dataset [45] comprising the outer environment images and textual cues can be found on multiple different objects such as shops, cafes, signboards, banners, posters, etc. This dataset contains 229 trained images and 233 test images. The dimension of each image varies from 3888 × 2592 to 350 × 200.
Total-Text: The dataset [46] contains curved, orientated, and horizontal textual cues that are very challenging to detect and recognize. The text appears on multiple objects present in the outer environment. The dataset contains 300 test images and 1255 trained images and each image's dimension varies from 180 × 240 to 5184 × 3456.
MSRA-TD500: The dataset [47] contains challenging outer and inner environment images in which the textual cues appear on doorplates, caution plates, signs, boards, etc. This dataset contains 300 trained images and 200 test images. The dimension of each image varies from 1920 × 1280 to 1296 × 864.

Evaluation Measures
The evaluation measures for the proposed intelligent system are given as follows.

Textual Cues Evaluation
The detection and localization of the textual cues is one of the main entities for a robust intelligent system. We evaluate the textual cues' detection and localization with standard evaluation protocols [48]-precision p, recall r, and frequency f measures defined as:

Location Retrieval Evaluation
The user-input keywords are matched with the recognized textual cues and similar location images are retrieved for the necessary actions such as applying the brake. For location retrieval, the mean average precision mAP is a commonly used evaluation measure which is the average of all queries. The mean average precision can be given as follows: where A denotes the number of relevant locations, B is the number of retrieved locations, R k denotes the top similar images consisting of the same textual cues as the user-specified keywords. Given the set of user keywords q i ∈ Q as {w 1 , w 2 , w 3 , . . . , w m }, where Q denotes the set of all the keywords specified by the user, the mAP for the proposed system is formulated as: where k is the number of retrieved location images having the most similar textual cues.

Implementation Results
In this section, we briefly describe the implementation details and discuss the output results. Firstly, the proposed intelligent system asks the driver to input the keywords of the desired location. For this purpose, we randomly select the different keywords as input from the test images of all the datasets defined in Section 3.1. Secondly, the proposed system detects and localizes the textual cues from the trained images of each dataset and applies OCR recognition. Thirdly, similarity learning is applied to compute the similarity between the input keywords and recognized textual cues. Lastly, whenever any similar textual cue is found, the intelligent system informs the driver, slows down, and applies the brake to stop. The detailed experiments are given as follows.

Textual Cues Detection
Since textual cues detection is one of the key phases for a robust intelligent system, we first evaluate the textual cues detection on the benchmark datasets.
In this experiment, we evaluate the efficiency and accuracy of the proposed intelligent system for detecting and localizing the textual cues on the different datasets. We use the textual evaluation protocols: precision p, recall r, and frequency f. The obtained results of the proposed system are given and compared with the state-of-the-art methods in Tables 1-4 for SVT, ICDAR'13, Total-Text, and MSRA-TD500 datasets, respectively. The proposed method outperformed state-of-the-art methods for the SVT and ICDAR datasets in textual cues detection and localization. For the Total-Text dataset, our proposed method achieved better precision as compared to state-of-the-art methods. For the MSRA-TD500 dataset, the proposed method achieved remarkable results with a better f score. Our main target is to detect the textual cues from the low-quality road and street images such as the SVT dataset that are really challenging to detect.

Locations Retrieval
This section describes the experimentations on the datasets for finding the desired locations. In order to show the better performance and accuracy of the proposed intelligent system, we conducted the experimentation on one-to-one and one-to-many location frames. The details of the experiments are given as follows: Experiment 1. One-to-one: In this experiment, the proposed intelligent system first asks the driver to input the keywords of desired locations to trace and proceed. Once the driver inputs one or more keywords, the proposed system creates a feature vector of those keywords and finds a similar location image with the same textual cues. In this experiment, the proposed system keeps finding similar textual cues and asks the driver to confirm. The feature vector for input keywords continually matches with each image and produces the score. The retrieval time for this experiment is much faster than Experiment 2 as it rapidly compares the similarity and presents the outcomes to the driver. The obtained results of Experiment 1 are given in Table 5 for SVT, ICDAR'13, Total-Text, and MSRA-TD500 datasets. Experiment 2. One-to-many: In order to show the efficiency and accuracy of the proposed intelligent system, we perform the one-to-many experiment. In this experiment, the proposed intelligent system takes input keywords from the driver for the desired location and traces the top-rank location images possessing similar textual cues. The actual purpose of this experiment is to show the robustness of the proposed system by retrieving the top ten similar images with similar textual cues. The obtained results of Experiment 2 are compared with the state-of-the-art methods in Table 6 for SVT, ICDAR'13, Total-Text, and MSRA-TD500 datasets. It is worth mentioning that the proposed method outperforms the majority of the previous methods in the one-to-many experiment. However, for the ICDAR'13 dataset, our proposed method could not compete for the ICDAR'13 dataset with the method [51] but still presented a remarkable performance with other datasets.

Retrieval Time Comparison
The time consumed during the computation of textual cues and finding the similarity is truly a critical parameter to be considered. The proposed intelligent system finds the textual features and discards the non-textual features during the localization step. The system maintains a good balance between textual and non-textual features. The retrieval time of the proposed system is compared for both experiments in Table 7. For the one-to-one experiment, the retrieval time is better and more robust since the similarity is compared between the two entities, i.e., driver-input keywords and targeted outer environment location image. For the one-to-many experiment, the retrieval time is higher since the similarity is computed to index top-rank location images.

Results Impact and Discussion
In this section, we discuss and compare the results of the proposed approach with state-of-the-art methods. The details are given as follows.
Textual cues: Since the textual cue detection plays a vital role in order to find the semantic locations, we first compared the textual cue detection rate for four different datasets. The output results are given on the SVT dataset for textual cues detection and localization in Table 1 [54] has the highest precision value of 87.6. However, the proposed method outperformed state-of-the-art methods for the highest recall and f score.
Locations retrieval: For retrieving the semantic locations, we have evaluated the proposed method for two different experiments. For the one-to-one mode, the proposed method achieved 69.2 mAP on SVT, 74.8 mAP on ICDAR'13, 59.1 mAP on Total-text, and 63.4 mAP on MSRA-TD500 datasets. The mAP is inferior due to the textual cue localization step and can be improved if the textual cues are further improved. For the one-to-many mode, the proposed method has achieved 66.8 mAP on SVT, 75.6 mAP on ICDAR, 57.4 mAP on Total-Text, and 61.7 mAP on MSRA datasets. Here, the close competent method [49] has 63.0 mAP on SVT and 71.0 mAP on ICDAR datasets. The method in [63] has the most inferior mAP 23.0 on SVT and the method in [62] has 65.0 mAP on ICDAR datasets. The proposed method has achieved an overall better mAP for all the datasets and outperformed state-of-the-art methods.
The proposed method has certainly a few limitations. Due to its simple feature description, it is robust and able to handle only a small number of data rather than millions of images. Another disadvantage is that our method might not work well with more complicated and complex images since it might not be able to generalize to the new data type. The proposed approach, however, is exempt from both extensive training and any expensive hardware needs. The proposed approach is often easier to adopt and can be used more quickly. As a result, the proposed method performed better than the majority of the current methods and achieved an overall greater mean average precision score. The datasets included images from outer environment scenarios; however, the system's performance can be further improved by more complicated conditions such as low light, occlusion, or a partial visibility of textual cues.

Conclusions
In this work, a novel intelligent system is proposed to sense the textual cues available in the outer environment for finding the locations in autonomous vehicles. The proposed system first asks the driver to input the keywords of the desired locations. Next, the system proceeds with the detection and recognition of certain textual cues appearing on different objects such as billboards, shops, signboards, walls, buildings, banner, etc. Whenever the system finds a location composed of similar keywords to the driver's input keywords, the system notifies the driver, slows down, and applies the brake to stop. The experimental results on four datasets show the robustness of the proposed intelligent system for autonomous vehicles to sense the textual cues appearing in the outer environment scenario. The proposed system has lesser computation complexity and does not require any specific hardware. The system is free from a tremendous amount of training due to its simple feature description. In the future, we intend to improve the retrieval accuracy of the proposed system. We will further improve the methodology for live tracking and perform the experimentations on real-time scenario with video frames.  Data Availability Statement: Data sharing is not applicable to this article.

Conflicts of Interest:
The authors declare no conflict of interest.